Abstract

Eatomics is a R-Shiny based web application that enables interactive exploration of quantitative proteomics data generated by MaxQuant software. The application contains a suite of both novel and established bioinformatics tools (mainly as R functions), wrapped into four major tabpanels, performing:

  • Quality control of the MaxQuant generated proteomics data
  • Differential protein expression analysis using statistical method called limma
  • Gene-pathway enrichment analysis using single-sample Gene Set Enrichment Analysis (ssGSEA)
  • Differential analysis of single/sample enriched pathways

Featuring quick and easy interpretation of the MaxQuant data even for researchers with limited programming background. Additionally, the interactive clinical feature selection developed, enables the user to explore the differentially expressed protein amongst the clinical parameter in one-click.

1. Pre-requisites

A recent version of Rstudio

1.0 Package Dependencies

As our Eatomics shiny application integrates several data analysis tools using R, a series of R packages are run on the background as mentioned in ´Details/Imports´: in the Package Description file.

1.1 Input files:

The app requires two file inputs:

  • A tabular input file (e.g. proteinGroups.txt) as generated by MaxQuant software. Each row represents a protein name, corresponding gene name etc and each column contains details on the protein intensities such as LFQ, iBAQ intensities of m number of samples.
  • A tab separated m x n matrix (e.g. clinicaldata.txt), containing the clinical data of m number of samples with n number of clinical observations.

The name of the samples shall be identical in both input files. Also, represent character variables as ‘factor’ and integers as ‘numeric’. One limitation is, ordinal values are treated as numeric. text

To see an example on the input files, please refer example files

2. Layout

Overview of the tab panels

2.0 Load and Prepare

The tab enables the upload and preparation of both input files. The tab uses set of novel R functions (in italics) to filter and visualize the quality of the data. Additionally, also adapts a series of existing data filtering functions from several other R packages such as DEP. The tabpanel provides an overview on the data quality and enables the user to filter and prepare data for downstream expression analysis.

The following subsections explain the tab in detail.

2.0.1 Loading of MaxQuant input file

To begin the analysis the user has to upload the MaxQuant file (e.g.proteinGroups.txt), which can be done in two ways:

  • Either browse the file from your local folder
    OR
  • Select a file from the server to try the example file that we provide

2.0.2 Loading of Clinical Data

The clinical data input file (e.g clinicaldata.txt) is also uploaded in the similar way as explained above.

2.0.3 Data Filtering Widgets

The following section explains the details of the widgets used in the tab.

  • The intensity widget uses the selectProteinData function enabling the user to select an intensity type such as LFQ (Label-free quantification) or iBAQ (Intensity Based Absolute Quantification), to be considered for succeeding differential expression analysis.
  • The filter widget uses the FilterProteins function allowing the user to select the minimum number of samples where a protein must be detected, as a first step to avoid proteins with many missing values across the samples. As default, the function also removes the reverse peptides and potential contaminants
  • In the protein names widget, the user can choose either
    • the checkForIsoforms function,enabling one to check the isoform that dominates the others by showing more than 80% of overall expression of that protein family, or,
    • the make_unique {DEP} function (adapted from DEP, R package), providing unique names to the proteins referring the corresponding gene names.
  • The impute widget uses the impute {DEP} function to impute missing values in a proteomics dataset using multiple imputation methods

  • The exclude column widget allows the user to exclude samples, especially if any outliers found after conducting initial quality analysis such as PCA, thus avoiding the sample from the consecutive steps in differential analysis.

Screenshot of the widgets

2.0.4 Visualization on Quality Control

Our Eatomics app wraps several functions to visualize the quality control of the uploaded proteinGroups.txt file.

Screenshot on the visualization plots developed anew are only provided in the documentation. Others, have been majorly adapted from DEP package. text

2.0.4.0 Interactive PCA plot

A common method of dimensionality reduction is principal component analysis (PCA). Inherently, PCA calculates axes of most variation (principal components) within the expression data. A common assumption is that a plot along the axes of most variation will segregate all samples/patients into groups under investigation.

Fig.1.Screenshot of the PCA plot widget

Fig.1.Screenshot of the PCA plot widget

2.0.4.1 Low abundance transcription factors

Gene lists for the families GATA, JUN, SH2, SOX and TBX were downloaded from https://www.genenames.org on 23.03.2017. Please note, that in fact not all members of these families are necessarily low-abundant.

Fig.2.Screenshot of the Low abundance transcription factors plot widget

Fig.2.Screenshot of the Low abundance transcription factors plot widget

2.0.4.2 Protein Numbers

Protein numbers describes the count of distinct proteins or isoforms per sample. The plot is generated by the plot_numbers() function from the DEP package which was adjusted to work without experimental design information.

2.0.4.3 Sample Coverage

Sample coverage describes how many proteins are shared across samples. I.e. there will be proteins that were detected in only one sample (top of the plot) or proteins detected in all samples (bottom of the plot). The plot is generated by the plot_coverage() function from the DEP package which was adjusted to work without experimental design information.

2.0.4.4 Missing Values - Heatmap

The missing values heatmap shows the distribution of missing values across samples. White areas mark missing values, while black is a valid value. If samples cluster based on their pattern of missing values, caution should be taken in further analysis as this pattern may perturb, e.g., differential expression analysis. In well-founded cases, samples should be removed from the analysis. The plot is generated by the plot_missval() function from the DEP package.

2.0.4.5 Missing Values - Quantitative

The missing values plot visualizes the density and cumulative protein intensities with and without missing values. It can guide the selection of the missing value imputation strategy. The plot is generated by the plot_impute() function from the DEP package which was adjusted to work without experimental design information.

2.0.4.6 Imputation

The missing value imputation plot shows the density of missing and valid values over range of the spectrum of intensities before and after imputation with the selected method. The plot is generated by the plot_detect() function from the DEP package which was adjusted to work without experimental design information.

2.0.4.7 Sample to sample heatmap

The sample-to-sample heatmap describes the biological and/or technical variability of the samples. It can be produced using different (dis-) similarity metrics (Pearson correlation or euclidean distance). For this plot ´r params$StSheatmapDistMetric´ was used. Formed clusters should resemble the sample groups under investigation.

Fig.3.Screenshot of Sample to Sample Heatmap

Fig.3.Screenshot of Sample to Sample Heatmap

2.0.4.8 Cumulative Protein Intensities

Protein intensities are cumulated across all samples and plotted according to their relative abundance. Colouring marks the respective quantile of the proteins. Highly abundant proteins, i.e., proteins ranked in the first quartile are colored in red and labels are specified. The top 20 ranked proteins and their cumulated intensity are given in the table to the right.

Fig.4.Screenshot of Cumulative Protein Intensities

Fig.4.Screenshot of Cumulative Protein Intensities

2.1 Differential Limma

limma (linear models for microarray data), is a commonly used R/Bioconductor software package for analyzing microarray and RNA-seq data. Limma functions fit statistical linear models and conducts differential expression analysis.
Eatomics uses limma to perform real time analysis of differentially expressed proteins amongst clinical parameters of choice. The resulting interactive visualization plot including volcanoplots (detailed below) allows a quick and detailed overview on the differetial expression analysis.

2.1.0 Limma Widgets

  • The clinical group widget allows the user to select the clinical parameter of their choice from the uploaded dataset, to perform limma analysis.

  • The compare widget shows all the available subgroups inside the selected clinical group. The user then selects two subgroups to compare and find the differential expressed genes in the selected subgroups.

  • The surrogate variable widget uses the sva{sva} function to remove the unknown source of variation such as batch effects from the expression value of the compared clinical subgroup. It is better to impute before, checking surrogate variable, as the sva function works better with non-missing values

  • The impute values provides the user the flexibility to choose the imputed MaxQuant expression values (from the Load and Prep tab)

  • The apply filters widget allows to extend filter to a second clinical parameter, where the user can narrow down the filtering to very specific subgroup of another clinical parameter of interest

  • Using threshold box widget, the user can interactively adjust for multiple testing corrected (adjusted) P-value and log FoldChange(logFC) , plotting the genes above the chosen threshold in the volcano plot

2.1.1 Visualization of Limma results

2.1.3.0 Volcano Plot

lmFit{limma} along with other liner model functions from limma performs the differential expression analysis on the selected clinical group. Later, the resulting table to top-ranked genes{limma} the volcanoplot is plotted. In the plot, the x axis plots the log-2-fold change value (`lfc´ from the table) where logFC> 0 means the genes are up-regulated and logFC<0 meaning genes down regulated in the clinical group selected first (i.e,cases in Fig.5.) and the y axis plots the -log10 of P value (´p.value´) from the gene list.

Fig.5.VolcanoPlot

Fig.5.VolcanoPlot

2.1.3.1 Up and Down regulated Data Table

The list of the genes up and down regulated (once that crosssed the given threshold) and their summary including the gene name, log-2-foldchange, P value and adjusted P value (adj.P.Val) is represented in the respective up and down regulated tables.

Fig.6.Screenshot of Up and Down regulated gene list

Fig.6.Screenshot of Up and Down regulated gene list

2.1.3.2 Detailed Description

The collapsible panel contains details on the input paramaters that were selected by the user through the limma widgets.

Fig.7.Screenshot of Detailed Description collapsible tab panel

Fig.7.Screenshot of Detailed Description collapsible tab panel

2.2 ssGSEA

The tab provides an easy to use user-interface on the established (ssGSEA) (single-sample Gene Set Enrichment Analysis) method which is an extension of GSEA. The interactive widgets in the tab are developed based on the freely accessed codes developed by Broad institute 1. The tab currently focuses only on four MSigDB(version v6.2) genesets namely, H- hallmark geneset, C2- KEGG geneset, C5- GO geneset and C1- positional geneset to calculate the enrich ment score and the user selects one geneset from these (similar to conventional GSEA).

Later, the resulting output,the .gct files with the respective geneset enrichment score will be auto-downloaded into the working directory, into the folder called EnrichmentScore.

1: Krug, K., et al., A Curated Resource for Phosphosite-specific Signature Analysis. Mol Cell Proteomics, 2019. 18(3): p. 576-593.

A message alert will be pop-up on download completion.

Screenshot of the ssGSEA widgets

Fig.8.Screenshot of ssGSEA tab panel

Fig.8.Screenshot of ssGSEA tab panel

2.3 Differential ssGSEA

Differential expression analysis of the pathways enriched via ssGSEA (using the previou tabpanel) in the clinical group of interest,is facilitated by the tab. The pathway enrichment scores geneset is tested for differential pathway expression using t.test. The test compares the enriched pathways differential expressed among two clinical conditions selected.

2.3.0 Differential ssGSEA widgets

  • The Gene Set Collection widget allows the user to select the geneset of his choice (from the available). On selecting the geneset, the file is auto-uploaded from EnrichmentScore folder, provided the user has saved the file using ssGSEA tab.

The score file is .gct file, a m x n matrix where m rows contain the name of the pathway from the selected geneset and n number of column represents the sample names matching to the clinical data.
Make sure the sample names (column names) format in the geneset collection matches to the sample names (PatientID) in the clinical data file. text

*Rest of the widgets in the panel follows the same pattern in the ‘differential limma’ tab

2.3.1 Visualization of differential ssGSEA

2.3.2.0 Volcano Plot

The interactive volcanoplot plots the log ratio (x-axis) vs -log10 of P value calsulated based of t.test. The bonferroni adjusted P value (adj.p< 0.05) is selected as the threshold to highlight the differentially expressed pathways.

Fig.9.Screenshot of VolcanoPlot

Fig.9.Screenshot of VolcanoPlot

The other two result formats including interactive data table and the detailed description panel lists the highlighted up and and down regulated pathways and details on the input parameters, respecively.

5. Gitlab Folder layout

Shiny R package
 
 R
    |app.R
    |helpers.R
    |run_app_function.R
 vignettes
    |About.rmd 
    |
 man
    |helpers_roxygen.R
    |
 markdown_docs
    |Tutorial.rmd
    |About.rmd
 Input Files
    |proteinGroups.txt
    |clindata.txt
 Datasets
     MSigDB
        |c1.all.v6.1.symbols.gmt
        |c2.cp.kegg.v6.1.symbols.gmt
        |c5.all.v6.1.symbols.gmt
        |h.all.v6.1.symbols.gmt
     TranscriptionFactors
        |GATA-familyHuman230318.txt
        |JUN-familyHuman230318.txt
        |SH2-familyHuman230318.txt
        |SOX-familyHuman230318.txt
        |TBX-familyHuman230318.txt
     EnrichmentScore
 Report
    |
|DESCRIPTION
    |
|NAMESPACE
    |
|README.md
    |